Document Length Normalization

نویسنده

  • Lillian Lee
چکیده

In the previous lecture we discussed pivoted document length normalization [Singhal et al. 96], a simple technique that applies a correction for the observation that document relevance correlates with document length. Through careful empirical verification of previous assumptions, they showed that the seemingly simple normalization term could have a big impact on results. However, in our discussion of the analysis that led to pivoted document length normalization, we passed over a basic question: How were the relevance judgments in the TREC dataset made on the approximately 740,000 documents and 50 queries?

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CS 6740 : Advanced Language Technologies February 4 , 2010 Lecture 3 : Pivoted Document Length Normalization

In this lecture, we examine the impact of the length of a document on its relevance to queries. We show that document relevance is positively correlated with document length, and see that relevance scores that use the normalization techniques we’ve studied so far (L∞, L1, L2) do not capture this correlation correctly. Finally, we present the “pivoted document length normalization” technique int...

متن کامل

Document Normalization Revisited

Cosine Pivoted Document Length Normalization has reached a point of stability where many researchers indiscriminantly apply a specific value of 0.2 regardless of the collection. Our efforts, however, demonstrate that applying this specific value without tuning for the document collection degrades average precision by as much as 20%.

متن کامل

Score Normalization Methods Applied to Topic Identification

Multi-label classification plays the key role in modern categorization systems. Its goal is to find a set of labels belonging to each data item. In the multilabel document classification unlike in the multi-class classification, where only the best topic is chosen, the classifier must decide if a document does or does not belong to each topic from the predefined topic set. We are using the gene...

متن کامل

Information Space Gets Normal

Experiments are presented based on unofficial results for TREC-7. Eigensystems analysis of a term cooccurrence matrix is compared to eigensystems analysis of a term correlation matrix. For each matrix type, the effect of term weighting and document length normalization is assessed. Recall-precision curves and other TREC statistics indicate that the use of the correlation matrix improves perform...

متن کامل

Improving Term Frequency Normalization for Multi-topical Documents and Application to Language Modeling Approaches

Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, so that term frequency is more increased than the well-summarized one. Second, multi-topicality indicates that a docu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010